The McDonalds Menu data-set

In this example we will use all items that you can get at McDonalds and their nutritional features to perform some cluster analysis using k-means. The data-set can be downloaded from Kaggle.

In total there are 260 products and 25 features/variables. The data looks like this:

library(tidyverse)
mcd_menu <- read_csv("mcd-menu.csv")
head(mcd_menu)
## # A tibble: 6 x 24
##   Category Item  `Serving Size` Calories `Calories from … `Total Fat`
##   <chr>    <chr> <chr>             <dbl>            <dbl>       <dbl>
## 1 Breakfa… Egg … 4.8 oz (136 g)      300              120          13
## 2 Breakfa… Egg … 4.8 oz (135 g)      250               70           8
## 3 Breakfa… Saus… 3.9 oz (111 g)      370              200          23
## 4 Breakfa… Saus… 5.7 oz (161 g)      450              250          28
## 5 Breakfa… Saus… 5.7 oz (161 g)      400              210          23
## 6 Breakfa… Stea… 6.5 oz (185 g)      430              210          23
## # … with 18 more variables: `Total Fat (% Daily Value)` <dbl>, `Saturated
## #   Fat` <dbl>, `Saturated Fat (% Daily Value)` <dbl>, `Trans Fat` <dbl>,
## #   Cholesterol <dbl>, `Cholesterol (% Daily Value)` <dbl>, Sodium <dbl>,
## #   `Sodium (% Daily Value)` <dbl>, Carbohydrates <dbl>, `Carbohydrates (%
## #   Daily Value)` <dbl>, `Dietary Fiber` <dbl>, `Dietary Fiber (% Daily
## #   Value)` <dbl>, Sugars <dbl>, Protein <dbl>, `Vitamin A (% Daily
## #   Value)` <dbl>, `Vitamin C (% Daily Value)` <dbl>, `Calcium (% Daily
## #   Value)` <dbl>, `Iron (% Daily Value)` <dbl>

Libraries that you need

There are a lot of libraries that we need. We will use:

library(cluster)
library(factoextra)
library(tidymodels)
library(plotly)
library(janitor)
library(GGally)

# We could think about transforming the serving size i.e. turn it into a numeric with g
# but what do we do with the baverages, coffee etc.?

mcd_menu<-mcd_menu%>%
              janitor::clean_names()

Step 1: Exploratory Data Analysis with GGally

How are some of our nutritional values related? Items in which category have the most sugar? The most calories? etc.

Step2: Scaling

We want to scale the data so all feautres (nutrional characteristics) are on the same scale i.e. have a mean of 0 and a variance of 1. For this we use across within mutate. Be aware that scale returns a matrix and we only need a numeric vector (as.numeric).

mcd_menu_scaled<-mcd_menu%>%
                    mutate(across(is.numeric,~as.numeric(scale(.))))

Step3: What is the optimal number of k?

We can use the function fviz_cluster from the package factoextra to produce two diagnostic plots. One with total within sum of squares and one for the silhouette value. Looks like 4 should be a good number of k here.

Step4: Interactive plot

After performing kmeans with k=4 we want to have a look at the result. Let’s create an interactive plot that allows us to explore the result in detail. Note that we only plot Calories and Sugar here. Other variable combination would be possible. Look at cluster 3. Isn’t it fun that the salads build a cluster on their own?